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Abstract 

In spite of the recent surge of interest in quantile regression, joint estimation of linear 
quantile planes remains a great challenge in statistics and econometrics. We propose a 
novel parametrization that characterizes any collection of non-crossing quantile planes 
over arbitrarily shaped convex predictor domains in any dimension by means of uncon¬ 
strained scalar, vector and function valued parameters. Statistical models based on this 
parametrization inherit a fast computation of the likelihood function, enabling penalized 
likelihood or Bayesian approaches to model fitting. We introduce a complete Bayesian 
methodology by using Gaussian process prior distributions on the function valued param¬ 
eters and develop a robust and efficient Markov chain Monte Carlo parameter estimation. 
The resulting method is shown to offer posterior consistency under mild tail and regularity 
conditions. We present several illustrative examples where the new method is compared 
against existing approaches and is found to offer better accuracy, coverage and model ht. 


1. Introduction 


Quantile regression (QR; Koenker and Bassett, 1978; Koenker, 2005a) has recently gained 


increased recognition as a robust alternative to standard least squares regression, with ap¬ 


plications to ecology, economics, epidemiology and climate science research (Burgette et al. 


2011; Eisner et ah, 2008; Dunham et ah, 2002 Abrevaya, 2001). By offering direct infer¬ 


ence on the non-central parts of a response distribution, QR allows researchers to identify 
and quantify a wide range of regression heterogeneity where the predictors affect the quar- 
tiles or the tails of the response distribution differently than its mean or median. This is 
illustrated in Figure [^a), adapted from Koenker (2005b), showing the estimated conditional 

“Acceleration” 


1985) with 


quantile curves for the well-known motorcycle data (Silverman 
(head acceleration, in g) as the response and “Time” (time from impact, in ms) as the ex¬ 
planatory variable. The estimates do a much better job of capturing the complex relationship 
between the two variables than what could be inferred through a simple mean regression or 
from more modern nonparametric density regression techniques ( ]De lorio et al. , 2004 Tokdar 
2010) as shown in Figures [^b)-(c). 

The estimates in [ija) were generated by using the original linear quantile regression 


et al. 


technique of Koenker and Bassett] ( [l 978 [ ). For a response proportion r G (0,1) let Qy{t\X) = 
inf{a : P{Y < a\X) > r} denote the r-th conditional quantile of a response Y given a 
predictor vector X. The linear quantile regression model postulates 


Qy(r|Y) = /3o(r)+Y^/3(r), 


( 1 ) 
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Figure 1: Estimated quantile curves at r G {0.1,0.2,--- ,0.9} (gray lines) and at r G 
{0.25,0.5,0.75} (black lines) for motorcycle data (open circles). Single QR fits were done 
with rqssO function of the quantreg R-pcakage. Rearrangement was done by obtaining 
single QR fits over the dense grid r G {0.01,0.02, • • • , 0.99}. GP joint QR was from a pre¬ 


liminary implementation of the method described in Section 3.2. Linear DDP De lorio et al. 
(|2004) was implemented with the R-package DPpackage of Jara et al. (2011). 


which is equivalent to saying Y = I3o{t) + X'^I3{t) -|- U with the error variable satisfying 
Qu{t\X) = 0. The model is linear in the model parameters {f3o{T), f3{T)). The predictor 
vector X may include non-linear and interaction terms of the original covariates. In the 
motorcycle data analysis, we used B-spline transforms (df = 15) of Time as predictors, with 
dim(X) = 15. The model parameters are easily estimated by linear programming and the esti¬ 
mates are consistent, asymptotically Gaussian and robust against outliers. Gurrent literature 
on quantile regression (QR) is both deep and diverse; see Koenker (2005a) for a comprehensive 
overview and Tokdar and Kadane (2012) for references to Bayesian approaches. 

Most scientific applications of QR require inference over a dense grid of r values, which 


is usually done by assimilating inference from single-r model fits (e.g., Eisner et ah, 2008). 
Such assimilations are often problematic. In Figure [^a) the estimated curves cross each 
other violating laws of probability; the waviness and the local optima of the curves change 
wildly across r reflecting poor borrowing of information; all quantile curves nearly collapse 
to a single point at boundary, where uncertainty should have been high due to data scarcity. 


Post-hoc rearrangement of the estimated quantiles (Ghernozhukov et ah, 2011) avoids the 
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embarrassing issue of crossing (Figure [^d)), but the other two problems persist. 

Joint estimation of the conditional quantile planes requires working with the linear spec¬ 
ification Q simultaneously for all r G (0,1). These specifications together define a valid sta¬ 
tistical model, parametrized by function valued parameters /3o : (0,1) —>■ M, /3 : (0,1) —>■ 
provided 

/^o('^i) + > I3q{t2) + I3{t2), for every pair t\ > T 2 and for every x £ X (2) 


where fb is a pre-specified domain for X. Such models and related methods are a minority in 
the current quantile regression literature, and existing approaches have severe shortcomings. 


The methods by 

He 

(199^ 

n and 

Bondell et al. 

(2010 

) impose serious restrictions on the shape of 

/3(r). The procedurt 

3 by] 

Junson and Taylor 

2005 

, based on substitution likelihood, does not 


scale to dense r grids and the role of substitution likelihood in Bayesian estimation remains 
debated (Monahan and Boos 1992). Tokdar and Kadane (2012) provide a complete, scalable 
solution for the univariate case, but their handling of multivariate X through univariate, 
single index projection is unsatisfactory. 

To date, the most comprehensive treatment is given by Reich et al. ( 2011[ ) who utilize 
monotonicity properties of Bernstein basis polynomials with non-negative coefficients to en¬ 
sure non-crossing quantile planes in any dimension. Their use of truncated Gaussian prior 
distributions on the non-negative coefficients leads to an attractive Gibbs sampling based 
Bayesian model fitting. However, both the model and the computing algorithm of Reich 


et al. (2011) crucially depend on the predictor domain X being a hyper-rectangle in M^. This 
is a fairly major handicap that may lead to a poor fit for reasons explained below. 

The specification of T is a critical model choice in QR. Without loss of generality, X can 
be chosen convex because (j^ holds over X if and only if it holds over the convex hull of X. 
The convex hull of the observed predictor vectors presents the most obvious practical choice. 
In spite of convexity, such an X may have a fairly irregular shape and may occupy only 
a fraction of the volume of the encompassing hyper-rectangle. The B-spline transforms of 
Time in the motorcycle data analysis live on a tiny 1 dimensional manifold in Quantiles 
planes that are required to be non-crossing over the larger hyper-rectangle will appear mostly 
parallel within the original X, as can be seen in Figure [^e). Unfortunately, such narrow 
predictor convex hulls are unavoidable whenever non-linear effects are sought within the 
Koenker-Bassett program or the measured covariates are naturally correlated. These are 
also situations where assimilation techniques exhibit dramatic crossing problems and hence 
a sound statistical model is most needed for joint estimation. 

For an arbitrary convex X, the space of /Jq, /3 curves satisfying © is highly non-regular 


and unsuitable for statistical modeling and investigation. For the case of p = 1, Tokdar and 


Kadane (2012) provides a much simpler representation parametrized by two monotonically 
increasing curves over (0,1). A generalization of this to any p > 1 and any convex X of 
arbitrary shape is currently not available in the literature. 

In this paper we propose a novel theory that delivers the right modeling platform for 
joint quantile regression. Our theory covers any dimension p and any bounded convex X of 
arbitrary shape. It provides a complete characterization of joint quantile regression in terms 
of a collection of scalars, vectors and curves all but one of which are entirely constraint-free. 
Even the one curve with a constraint has only a mild shape restriction on it; it is required to 
live in the space of all CDFs on (0,1) with full support. Our reparametrization leads to an 
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easy likelihood score calculation in the model parameters, making it ideally suited to develop 
practicable methods by using either penalized likelihood or Bayesian techniques. 

We build upon this novel theory to introduce a semiparametric Bayesian methodology 
for joint quantile regression over any X and any p, where the curve valued model parameters 
are assigned Gaussian process and transformed Gaussian process priors within a hierarchical 
setting. Asymptotic frequentist properties of the method are studied in Section and we 
establish posterior consistency over a broad class of true data generating distributions with 
linear quantile curves. For parameter estimation, we propose a Monte Garlo technique that 
incorporates efficient model space discretization, adaptive Markov chain sampling (Haario 


et ah, 1999) and reduced rank approximation (Tokdar, 2007 Banerjee et al. 2008a). We 


provide empirical evidence (Section [5||6| ) that our Gaussian process method enjoys much better 
estimation accuracy and coverage than the method by Reich et al. (2011), and our estimates 
are comparable to regularized versions of the classical single-r estimates. We consider the 
developments here make a strong case for linear quantile regression to be used as a model 
based inferential method rather than just an exploratory tool! 


2. A novel theory of joint quantile planes estimation 


2.1 Characterizing non-crossing hyperplanes 


We focus only on the case where the response distribution is non-atomic and admits a prob¬ 
ability density function conditionally at every X, and hence (§ is equivalent to requiring 
$o{'r) + > 0 for all r G (0, l),x G X. Our theory could be extended to atomic re¬ 

sponse distributions with known atoms. Assume 0 is an interior point of X. This can be 
achieved without any loss of generality by a simple translation of the predictors once a suit¬ 
able interior point is found within the convex hull of the observed predictors; see Appendix 
B.l for more details. Dehne a map b i—>■ a(b, X) on MP U {oo} as 


a{b,X) = 


suPx&x{-x'^b}/\\b\\ 67 ^ 0 , 

00 6 = 0 . 


Note that for every 6 7 ^ 0 we have a( 6 , X) G (0, 00 ) because X is bounded with 0 as an interior 
point. 


Theorem 1. Let X be a bounded convex set in with zero as an interior point and let 
(3 o{t) and (3 {t) = (/3i(r),--- ,j5p{T)Y' be real, differentiable functions in t ^ (Ojl)- Then 
$o{t) + x'^${t) > 0 for all r G (0,1) at every x £ X if and only if 


Mt) > 0 , 


PiT) = /3o(t) 


_ vjT) 

a{v{T),X)^Jl -h ||u(r)||2’ 


TG (0,1), 


( 3 ) 


for some p-variate, real function v{t) = (ui(r), • • • ,Up(r))^ in t & ( 0 , 1 ). 

Proof. If part. Suppose ([^ holds. For any r G (0,1) either v{t) = 0 in which case /3(r) = 0 
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and so $o{t) + > 0 at every x £ X. Otherwise, if v{t) ^ 0 then at any x £ X. 


$o{t) +x'^P{t) = Pq{t) + 
= /3o('r) |l - 

> /3o(r) |l - 

> 0 


x'^v{t) 


a{v{T),X)^yi + ||n(r)P j 

/ \\v{T)f~{-x^v{'r)}/\\v{r) 

Vl + ||n(r)||2 a{v{T),X) 


\v{t) 


1 + ||n(r)||2 


Only if part. We must have /3o{t) > 0 for all r G (0,1) because X contains 0. For any 
r £ (0,1), if ${t) = 0, set v{t) = 0. Otherwise, $o{t) > {—x^${t)} at every x £ X and hence 
Mr) > ||/3(r)||a(^(r),df). So the positive scalar c(r) = [[/3o(T)/{||/3(r)||a(/3(r), 
satisfies ||,d(r)||a(/3(r), d:i)//3o(r) = c(r)/{l + c(r)2}V2, get 

v{t) = c{t)${t)/\\${t)\\, rG(0,1). 

A v{t) constructed as above defines a real p-variate function on r G (0,1) and satisfies @ . □ 


2.2 An almost constraint-free parametrization of linear quantile regression 

Theorem greatly reduces the monotonicity constraint on the quantile hyperplanes to that 
on a single function I 3 q{t). Construction of a single monotone function is a relatively easy 
task, but some care is needed in handling the range (/3o(0),/3o(l)), which corresponds to the 
support of the conditional density of Y given X = 0. We pursue a model for /3 o(t) based 
on a user specified or default “prior guess” /o(y) for this conditional density. In the special 
case where (/3o(0), /3o(l)) is a known finite interval, /o could be chosen with support equal 
to the same interval. In general /o should be chosen to have support (— 00 , 00 ), such as a 
standard normal density, or a Student-t density with a modest degrees of freedom if the 
response distribution is expected to have heavy tails. We focus only on this general case, 
although the model described below could be easily modified to /o supported on a bounded 
interval. 

Let /o have support (— 00 , 00 ) and define its cumulative distribution function Fo{y) = 
J -00 fo{z)dz, quantile function Qo{r) = Fq^{t) and quantile density qo{r) = Qo(r)- Let 
To = Fq{0); by the full support assumption, 0 < tq < 1. We pursue a model for /3o and (3 as 
follows 


Mro) = 70 , / 3 ( to ) = 7 
r-CO 


Mr) - Mro) = <r qo{u)du, r G ( 0 , 1 ) 

d Cirri) 

Pir) - /3(to) =cr [ 

Jc 


CCo) 

CM 


w{u) 


C(ro) a{w{u),X)^/lTfuMW 


qo{u)du, T £ (0,1) 


(4) 

(5) 

( 6 ) 


with model parameters 70 G M; 7 G cr > 0; rc : (0,1) —)> an unconstrained p-variate 

function on (0,1); and C : [0,1] —)• [0,1], a differentiable, monotonically increasing bijection, 
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i.e., a difFeomorphism, of [0,1] onto itself. We write {/3o,j3) = T{'yo,^,cr,'w,C) to indicate 
f3o,/3 defined as in Q-Q. 

All model parameters, except the diffeomorphism (^, are essentially unconstrained. The 
function space of C is simply the space of cumulative distribution functions associated with 
all probability densities with support [0,1]. Such function spaces are easy to handle for 
statistical model fitting; a simple approach is presented in Section 3. Note that when is the 
identity map of (0,1) onto itself and w{t) = 0, we get /3o = crQo, /3 = 0, and the resulting 
joint, linear quantile regression model simplifies to a standard homogeneous, linear model: 
Yi = lo + Xjl + crQ with e* ~ /q. This model indeed provides a complete representation of 
all /3o, /S satisfying the non-crossing condition (§, subject to a matching range criterion, as 
detailed below. 

Theorem 2. Let fio : (0,1) —)■ M, ,0 : (0,1) —)■ he differentiable with (/3o(0),/3o(l)) = 

(— 00 , 00 ) [defined in the limit]. Then (§ holds if and only if {/3o,fi) = T(7o, 7; <7, re, C) for 
some 7o G M, 7 G cr > 0, rc : (0,1) —>• and, a diffeomorphism from [0,1] onto itself 

[0,1]. 

Proof If (/0o,^) = T{-fo,'y,cr,w,C) then, 

$o{t) = 0-®(C(r))C(r), ${t) = i3o{t) ==, 

with v{t) = w{f{T)). Hence, by Theorem]^ we only need establish that any real, differen¬ 
tiable function fio on (0,1), with / 3 o ( t ) > 0 for all r G (0,1) and /0o(O) = 00, /0o(l) = 00, can 
be constructed as in ([^ for some diffeomorphism f : [0,1] —)> [0,1] and fi > 0. This is indeed 
true, since one could fix cr > 0 arbitrarily, and then take, 

C('r) = To ^70 + ^ T G (0,1), 

which is differentiable and monotonically increasing in (0,1) since ({t) = /o(7o + ifioiT) — 
7o)/cr)/3o(T) > 0 for all r G (0,1), and, C(0) = To(-oo) = 0, C(l) = To(oo) = 1. □ 

When /3o(0) or /3o(l) is finite (or both), we can still write fio as in but we need either 
C(0) > 0 or C(l) < 1 (or both). While such a does not strictly belong within our 

model space, they can be approximated arbitrarily well by a model element T(7o, 7, o', f, C), 
and consistently estimated from large samples (Lemma and Section . 


2.3 Likelihood evaluation 


A salient feature of a valid specification of Qy (r|x) for all r G (0,1) is that it uniquely defines 
the conditional response density /y(y|x) over x G A, given by 


frivlx) = 


WfQvifix) 


'r=Tx{y) 


where Tx{y) solves (5y(r|x) = y in r (Tokdar and Kadane, 2012). Consequently, one can 
define a valid log-likelihood score 


fY{yi\xi) = -'^log{$o(Txfiyi)) + xj fi^Txfiyi))}. (7) 
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in the model parameters based on observations {xi,yi), i = 1,... ,n. From Q-Q, we could 
write 


/3o(r) + x^/3(t) = (TqoiC{T))C{T) 


1 + 


u;(C(r)) 


a{w{CiT), X)^Jl + |k(C(r) 


and therefore a quick evaluation of the log-likelihood score is possible once we figure out 
Txi{yi) for each i = 1,..., n, by solving r = /^^{/3o(u) xf ^{u)}du. 

With enough resources, these numbers could be found up to any desired level of accuracy 
through standard numerical methods for integration and root finding. But for all practical 
needs, model fitting and inference could be restricted to a dense grid of r G {fi,... Xl}-, for 
which finding Tx^{yi) requires only a simple sequential search involving trapezoidal approxi¬ 
mations to the integral of /3 o(t) +x'^^{t). Algorithm presents a pseudo-code for likelihood 
evaluation involving only simple matrix and vector multiplication. The code runs extremely 
fast when implemented in any low-level programming language with quick “for loops”. In 
our numerical studies we used a C implementation which offered 1000 likelihood evaluations 
in 2 seconds on an Intel(R) Core(TM) i7-3770 machine with n = 1000, p = 7 and a grid over 
r with mesh size 0.01. 

A practical issue with a discrete grid of r is that it needs to cover the image of the data 
range mapped into the quantile space, while ensuring the grid length L remains manageable. 
In our implementations we chose a data dependent grid as follows. We used equispaced 
grid points between r = 0.01 and r = 0.99 with an increment of 0.01. Next, on the upper 
tail, we augmented the grid with new grid points 0.995, 0.9975, • • • until we covered r = 
1 — l/(2n) where n is the sampler size. Same augmentation strategy with geometrically 
reducing increment lengths were adopted on the lower tail to reach up to r = l/(2n). 


3. Bayesian inference with hierarchical Gaussian process priors 


3.1 Prior specification 

We adopt a Bayesian approach to parameter estimation with suitable prior distributions on 
the model parameters, including the function valued parameters C and w = {wi, ..., Wp). It is 
useful that WjS are completely unrestricted, allowing us to handle them with Gaussian process 
prior distributions. For handling (^, we first introduce a constraint free version wq : (0,1) —)■ M 
related to C through the “logistic transformation”: 


C(r) = 




re (0,1), 


( 8 ) 


and use a Gaussian process prior on wq] see Lenk (1988); Tokdar (2007) for similar uses in 
density estimation. 

Recall that a Gaussian process g = {^(t) : r G (0,1)} could be viewed as a random 
element of the Banach space of real valued functions on (0,1) equipped with the supremum 
norm. Every Gaussian process g is characterized by two functions, the mean function m(r) = 
Eg(T) and the non-negative definite covariance function c{t^t') = Cov(5(r),^(r')), and we 
use the label GP{m,c) to denote such a process. When g ~ GP{m,c), for any finite set of 
points {ri,..., r^} the random vector {g{Ti),... ,g{Tk)) has a A:-variate Gaussian distribution 
with mean (m(ri), • • • ,m{Tk))'^ and k x k covariance matrix with elements c{Ti,Tj). 
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Our prior specification can be expressed in the following hierarchical form: 


Wj 

r^GP{0,K]c^^{-, 

II 

• ,p 

(9) 

{kj , Xj) 

~ 7rk{K'j)TTxiXj), 

j = 0,...,p 


(10) 

2? 

o 

-J 

to 

~ vr(7o, 7, cr^) oc 

1 

AT’ 


(11) 


where t'| A^) = exp(—A^(r —r')^) is the so-called square exponential covariance function 
equipped with a rescaling parameter A (van der Vaart and van Zanten, 2008). This particular 
choice of the covariance function is motivated by two facts. First, for any fixed A > 0, the 
probability distribution GP{0, A)) assigns 100% probability to the set of all continuous 

functions on (0,1) and hence our prior specification does not a-priori rule out any valid 
specification of the joint linear QR model. Second, A plays the role of a bandwidth parameter 
for the sample paths generated from GP(0,c®®(-, jA)), with more wavy paths realized as A 
gets larger. In a seminal work, van der Vaart and van Zanten (2009) show that with a suitable 
prior distributions specified on A, the resulting rescaled square-exponential Gaussian process 
prior offers adaptively efficient estimation in nonparametric mean regression and density 
estimation problems by automatically adjusting A to attain optimal smoothing. 

For specifying ttx, it is more insightful to fix a small h > 0 and consider the quantity 
Ph(A) = exp(—/i^A^), which gives the correlation between wj^t) and Wj{T + h) given Xj = A, 
and assign ph{X) a Be{a\,bx) prior. In our applications we use h = 0.1, ax = 6 and bx = 4, 
which assigns 95% mass to po.i(-^) £ (0.3,0.86). However, in our experience, the method 
shows little sensitivity to these choices. We take vr^ to be IG{aK,bi^), the inverse gamma 
pdf with shape Ok and rate 6^- The inverse gamma choice allows us to integrate out all kj 
parameters at the time of model htting. In our applications, we use a^, = b^ = 3/2, which 
is small enough to ensure a reasonably diffuse marginal prior on each Wj while retaining a 
finite second moment. 

Our choice of vr^ and the right Haar prior on the location scale parameters (7o,7,cr^) is 
partially motivated by our numerical experimentations in which we found these choices to 
lead to estimates and credible intervals most similar to the Koenker-Basette estimates and 
confidence intervals. Other reasonable choices could be made and we discuss in Section [3 
choices that offer useful shrinkage properties. 

When no special information is available about the support of Y, we take /o to be a 
Student t-distribution with an unknown degrees of freedom parameter v and assign zz/6 a 
standard logistic prior distribution. The logistic prior is reasonably diffuse and helps the 
resulting method adapt well to a wide spectrum of tail behavior of the response distribution. 


3.2 Model fitting via discretization and adaptive blocked Metropolis 

With likelihood evaluation discretized over a grid of r values {fi,... Pl} as in Algorithm 
the curve valued parameters Wj, 1 < j < p are needed to be tracked only over the specified 
grid, reducing each curve to a parameter vector of length L. The same applies to wq from 
which and C could be obtained on the grid by using the trapezoidal rule of integration. 
While it is theoretically possible to fit the model by running a Markov chain Monte Carlo 
over these parameters vector and the other model parameters, such a strategy is not entirely 
practicable. The parameter vector derived from any Wj is conditionally an L dimensional 








Gaussian variable given \j and kj , and evaluating its log prior density requires factorizing or 
inverting a Lx L covariance matrix which has an 0{L^) computing complexity. Furthermore, 
a Markov chain sampler that operates on both these parameter vectors and the rescaling 
parameters XjS run into serious mixing problems. 

To overcome these difficulties, we use two sets of further discretization. First, we replace 
TTx with a dense, discrete approximation covering the range po.i{X) G (0.05,0.95). Let 
denote the approximating probability mass function with support points {A^,..., A^}. We 
choose the support points to be more densely packed for smaller A values, the rational behind 
this and the exact manner in which the grid is chosen are discussed in Appendix |B.2 

Next, we fix a set of uniformly spaced knots {t^,... , C [0,1], for some m much smaller 
than L and replace each Wj curve with 

Wjir) := E{wj{T)\wj{tl),... e ( 0 , 1 ), ( 12 ) 


which provides an interpolation approximation to Wj over (0,1), passing through the points 
k = l,...,m, and determined entirely by the m-dimensional vector Wj^ = 
, whose prior density evaluations require only 0{m^) flops. Such in¬ 
terpolation based low rank approximations to Gaussian process priors are widely used in 


statistics and machine learning literature, see for example, Snelson and Ghahramani (|2006|); 
Tokdar ( 2007| ; Banerjee et al.f (2008b). 


Our treatment here, however, differs slightly from the above papers in that we carry out 
the conditional expectation in (12) after marginalizing out both Aj and kj. Let Wj denote 
the L-dimensional vector {wj{ti),... ,Wj{tL))'^ that is needed for the likelihood evaluation. 
Then we can write. 


G 


*3 


9=1 


where Ag denotes the L x m matrix C, 


(A3)C„(Ag)-l with a*(Ag) = 
and C„(Ag) = ((c^®(t)',* and Pg{Wj^) oc 7rl{Xg)p{Wj^\Xg) with 


piWj.\Xg)<x7rliXg){l + 




(Ag)W 


-{aK+m/2) 


3 * 


2K 


r(a«; -h 

r(a,) 


the multivariate t-density of VFj* given Xj = Xg. Also notice that the marginal prior density 
of Wj^ is precisely Yl^=i'^*x{Xg)p{Wj^\Xg). 

With the help of the above sets discretization, our joint QR model is entirely determined 
by the (m -|- l)(p -|- 1) -|- 2 dimensional parameter vector 0 = ..., 70) 7^, i')'^ 

and model fitting may be carried out by running a Markov chain sampler on 9 followed by 
Monte Carlo approximations of posterior quantities. In our experience, an adaptive blocked 
Metropolis sampler has worked extremely well, offering fast mixing and reproducible results. 
For this sampler, we use p -|- 3 block updates of 0 per iteration of the sampler, where the first 
p -|- 1 blocks are given by (llj^,7j)'^, j = 0,... ,p and the last two blocks are (70,7'^)'^ and 
(loglog z^)^. For each block, we perform a random walk Metropolis update governed by a 
multivariate Gaussian proposal distribution centered at the current realization of the block 
and with covariance that is slowly adapted to resemble, up to a scaler multiplication, the 
posterior covariance matrix of the block, where the scaler multiplier is also adapted slowly to 
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achieve a pre-specified acceptance rate. We carry out these updates according to Algorithm 


4 in Andrieu and Thoms (2008) 


In our implementation, we precompute and save the matrix Ag and a Cholesky factor 
Rg of C^:^:{Xg) for every g = 1,. .. ,G and plug them into the likelihood and prior density 
evaluations during Markov chain sampling. The precomputation step adds little overhead 
cost but results in a big jump in computing speed by drastically reducing the computing 
time for each Markov chain iteration. 


4. Posterior consistency 

Frequentist justification of Bayesian methods are often presented in the form asymptotic 
properties of the posterior distribution. A basic desirable property is posterior consistency: 
the posterior mass assigned to any fixed neighborhood of the true data generating model 
element should converge to 1 in probability or almost surely as sample size goes to infinity. 
More refined evaluations of asymptotic properties emerge through posterior convergence rate 
calculations, where one considers a sequence of shrinking neighborhoods and calibrates the 
fastest rate of shrinkage for which the posterior mass assigned to these neighborhoods still 
converges to 1. 

We restrict only to a study of weak posterior consistency of the Gaussian process based 
QR method developed in this paper. For a formal treatment, we consider a stochastic design 
setting where AjS are drawn independently from a pdf fx on X. Since any valid specification 
of the quantile planes {Qy{t\x): t G (0,1), x G X} uniquely corresponds to a specification 
of conditional response densities {friulx) : y G 'R,x G X}, it also uniquely corresponds to 
a bivariate density function f{x,y) = fx{x)fY{y\x) under the stochastic design assumption. 
Hence our prior specification on the quantile planes induces a prior probability measure H 
on the space R of probability density functions on A x IK- If f*{x,y) = fxix)fY{y\x) is the 
true data generating element in this space, then the posterior is said to be weakly consistent 
at f* if Ii{U\{Xi,Yi),i = 1,... ,n) —)■ 1 almost surely for every weak neighborhood U of f* 
in R. 


The celebrated Schwartz Theorem (Schwartz, 1965) provides a fairly sharp sufficient con¬ 


dition for weak posterior consistency of H at f*. Let dKL{p,Q) ■= f P^og{p/q) denote the 
Kullback-Leibler (KL) divergence. For any f G J- and e > 0, let iFe(/) denote the e-KL neigh¬ 
borhood {5 G A : dxLif, ff) < e}- We say that f* is in the KL support of H if > 0 


for all e > 0. Schwartz (1965) proved 

Theorem 3 (Schwartz). The posterior is weakly consistent at f* G T if f* is in the KL 
support of n. 

We show that an f* with linear conditional quantiles Qy(r|x) = ^^{t) -|-x^/3*(t) belongs 
to the KL support of H under mild smoothness and tail conditions. Tail conditions are needed 
to ensure that d_ft:L(/y(-ja:), /y( jx)) < 00, which holds when /y(-|x) has tails decaying faster 
than those of /y(-|x), with / generated from H. With our choice of H, the tails of /y(-|x) are 
expected to be similar to those of /o, and hence, a minimum requirement is that the tails of 
/y(-|x) decay faster than those of /q. We make the notion of faster tail decay more precise 
with the following definitions. 

Definition 1. Let f be a probability density function on M with quantile function Q. Take 
m = Q{tq). All statements below are interpreted with respect a given /q. 
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1. We say f has a type I left tail if Q{0) > —oo, and, for every a > 0, 

-;> cl((t) G (0,oo), as 11 0, (13) 

with, CL{a) —>-0 as (T I 0. 

2. We say f has a type II left tail if for every a > 0, -fo{rn+ diverges 


to oo as t i 0 and. 


(14) 




with, UL{a) —)• 0 as cr I 0. 


3. Same definitions apply to the right tail, with, Q{l — t) replaeing Q{f) in (13), (14), and 
cr and Ur denoting the right tail counterparts of cl and ur. 


Recall that we have taken /o = fo{-\i^) = tjy with a prior on i/ G (0,oo). Notice that an / 
has a type I left tail with respect to any t,^, when supp(/) is bounded from below, which is 
same as saying <5(0) > —oo, and, f{y) is bounded away from zero near <5(0). If Q(0) > —oo 
but /(y) —)■ 0 as y —)■ <5(0) then / has a type II left tail with respect to any t^,. If Q(0) = —oo 
and f{y) decays to zero as y —)• —oo at a polynomial or faster rate, then, / has a type II 
left tail with respect to t^ for all i/ > 0 sufficiently small. It is straightforward to see that 
d-KL^f, fo) < oo whenever / has tails that are type I or type II with respect to /q. 

It turns out that a type I or II tail condition on /y(-|0), coupled with some regularity 
conditions on f3o,l3 are all that is needed to ensure consistency. Here is a precise statement. 


Theorem 4. Suppose (Iq, j3* are differentiable on (0,1). Also assume /3 */Pq can be extended 
to a continuous function on [0,1], and, there exists a cq > 0 such that $Q{t) + x'^$*{t) > 
cofo{t) for all t G (0,1). Then f* belongs to the KL support o/H whenever /y(-|0) has type 
I or II tails with respect to t^ for all small enough z/ > 0. 


A proof is given in Appendix 


A.2 


The two regularity conditions on (/3 q,/3* 


ensure 

that the conditional density functions do not exhibit pathological behaviors in the tails. 
Notice that the basic validity assumption f^it) + x'^$*{t) > 0 for all t G (0, 1) automatically 
guarantees that f3* (t) / ^^{t) is bounded for all t. To see this, notice that A must contain an 
open ball of radius r > 0 around origin which is an interior point. So, for any t G (0,1) with 
f*{t) / 0, u _:= -rp*{t)/\\P*{t)\\ G A, and hence, 0 < $o{t) +u^$*{t) = f^it) - r\\f 
and hence, \\f*{t)/$Q{t)\\ < Ijr. 


5. Numerical Experiments 
5.1 A small experiment with a triangular A. 

To illustrate why adjusting to the shape of A is important for joint QR estimation, we 
generated 200 synthetic observations from the model: 

X ~ Uniform(A); A = {x = (xi, ^2)^ G —1 < xi, X2 < 2, xi + X2 < 1}, 
Qy[r\X) = ^-(^g + ^^) Qjv(r|0, 1) + i±^+^Q^(r|l, 0.2^) (15) 
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Figure 2: Triangular X example. Left panel shows quantile planes and their extensions to 
the embedding rectangle. Other three panels show estimated coefficient curves. KB: classi¬ 
cal Koenker-Basette estimates; BP: Bayesian estimates from Bernstein polynomial model of 


Reich et al. (2011); GP: estimates from the proposed Gaussian process method. 


where QN{T\fi, ( 7 “^) denotes the r-th quantile of the distribution. The hyperplanes 


on the right hand side of (15) are correctly ordered on the triangular predictor space X, 


but cross each other inside the smallest embedding rectangle [—1, 2] x [—1,2], as seen on the 
left panel of Figure]^ This negatively impacts estimation by the Reich et al.| ( |20lT ) method 
(Figure]^, which cannot adapt to the triangular shape of the predictor space and is restricted 
to estimates that do not cross on the smallest rectangle enclosing all observed predictors. In 
contrast, our method, which works on the convex hull of the observed predictors, can retrieve 
the true parameter curves with a much higher accuracy. 


5.2 Performance assessment: univariate X 

For a thorough study of the frequentist performances of the proposed method, we simulated 
100 synthetic datasets each with n = 1000 observations from the model 


X ~ Uniform(—1,1); Qy{t\X) = 3(r — h) log 


1 


r(l -r] 


+ 4(r- 5)^ log 


1 


r(l -r 


X, 


and compared parameter estimation against the methods of Reich et al. (2011) and Koenker 


and Bassett (1978). Here X is one dimensional, and the shape of T is a not an issue. However, 


the nearly quadratic /3i(r) function is slightly challenging to estimate. 


We used the default setting for the method by Reich et al. (2011) with 5 basis functions. 


For each implementation, the Gibbs sampler was run for 10000 iterations and 200 samples 
from the second half of the chain were used for Monte Carlo. We also tried two other 
versions with 10 and 15 basis functions respectively. But increasing the number of basis 
functions resulted in a progressively poor performance, and thus we only report here the 
results from the 5 basis function setting. The QuantReg package in R was used to implement 


the classical method by Koenker and Bassett (1978) and conhdence intervals were constructed 


with 200 bootstrapped samples. For our Gaussian process method, we used 6 equispaced 
knots = (/c — l)/5, k = 1,...,6. We ran the adaptive blocked Metropolis sampler for 
10000 iterations with 10% burn-in and used 200 samples from the rest for Monte Carlo. 
Nearly identical results were obtained with 11 equispaced knots. 
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Figure 3: Assessing performance with univariate X. Left two panels show mean absolute 
estimation errors and right two panels show coverage by 95% confidence or credible bands. 
KB: classical Koenker-Basette estimates; BP: Bayesian estimates from Bernstein polynomial 
model of Reich et al. (2011); GP: estimates from the proposed Gaussian process method. 


Figure shows comparisons of pointwise mean absolute estimation errors of the three 
methods and also the coverage of the associated 95% confidence or credible bands, averaged 
across the 100 synthetic datasets. Our method offered lowest estimation errors over the entire 
range of r values, and a consistently high coverage close to the nominal target of 95%. It is 
important to remember that the credible bands produced by our method are calibrated in 
a Bayesian way, and so 95% credible bands are not automatically guaranteed to offer 95% 
coverage. 


5.3 Performance assessment: multivariate X 

For assessing performance in the multivariate case, we ran another simulation study with 
synthetic data generated from the model: 

X ~ Uniform({x G : ||x|| < 1}); (5y(r|X) = /3 o(t) + X'^I3{t) 


with /3o and (3 specified by the equations 


/3o(0.5) = 0, /3(0.5) = (0.96 -0.38 0.05 -0.22 -0.80 -0.80 -5.97) , 

r(l-r) + ||^(t -)||2 


where Uj(r) = Y2‘i=o^ij ' 1/(3^)), 1 < J < 7, with (/)(-|/i,denoting the 

density function and 

/ 0 0 -3 -2 0 5 -1\ 

a= -3 0 0 2 4 1 0 . 

\0 -2 2 2 -4 0 0/ 

These specifications define a valid model by Theorem because a{b,X) = 1 for any non-zero 
b when X is the unit ball centered at zero. Also note that with these specifications, (5y(r|0) 
is precisely the quantile function of the standard logistic distribution. For simulating an 
{X, Y) from the model we set X = UiZ/\\Z\\ and Y = Qy(U 2 \X) where t/i, t /2 ~ 17(0,1) and 
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Figure 4: Assessing performance with a 7 dimensional X. Top two rows show estimation 
errors and bottom two rows show coverage by 95% confidence or credible bands. KB: classi¬ 
cal Koenker-Basette estimates; BP: Bayesian estimates from Bernstein polynomial model of 


Reich et al. (2011); GP: estimates from the proposed Gaussian process method. 


Z ~ Ny{0 ,I7), drawn independently of each other. We evaluated each instance of Qy(?72|X) 
to a precision of 10“^® by numerically integrating and /3 between 0.5 and U 2 with the 
integrate() function in R. 

Figure [^compares the estimation error and coverage of the three methods averaged across 
100 datasets of size n = 1000 generated from the above simulation model. All three methods 
were set as in the previous example. Like before, our method again offered nearly lowest 
estimation errors and a consistently high coverage close to the nominal target of 95%, over 
the entire range of r values. 
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6. Case studies 


6.1 Plasma concentration of beta-carotene 


Nierenberg et al. (1989) presents a study of the association of beta-carotene plasma con¬ 
centrations with dietary intakes and drugs use for nonmelanoma skin cancer patients. The 
Statlib database (http://lib.stat.cmu.edu/datasets/Plasma_Retinol) hosts a subset of 
the data from 315 patients who had an elective surgical procedure during a three-year period 
to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found 
to be non-cancerous. This dataset has been analyzed in the literature ( ]Kai et 2011) to 
assess how personal characteristics, smoking and dietary habits as well as dietary intake of 
beta-carotene affects concentration levels of beta-carotene in the plasma. 

We analyzed the same data with our joint QR model with plasma beta-carotene concen¬ 
tration (ng/ml) as the response and 11 covariates consisting of age (years), sex (l=Male, 
2=Female), smoking status (l=Never, 2=Former, 3=Current Smoker), Quetelet index or 
BMI (weight/(height^)), vitamin us^ (l=No, 2=Yes, not often, 3=Yes, fairly often), and 
daily consumption of calories, fat (g), fiber (g), alcohol (number of drinks), cholesterol (mg) 
and dietary beta-carotene (meg). These covariates gave rise to 13 predictors when the cate¬ 
gorical variables (sex, smoking status and vitamin use) were coded with dummy indicators. 
Estimated intercept and slope curves, with 95% credible bands are shown in Figure 

The estimated intercept curve strongly suggests a longer right tail for the response distri¬ 
bution. The slope curve estimates indicate that being female, use of vitamin and consump¬ 
tion of fiber have reasonably strong positive effect on plasma concentration of beta-carotene, 
whereas, smoking and BMI have reasonably strong negative effect. Calories, fat, alcohol or 
cholesterol consumption appears to have little effect. Dietary intake of beta-carotene ap¬ 
pears to have a positive effect, but the inference is not conclusive. The slope estimates in 
Figure suggest more dramatic effects of some predictors on the upper quantiles, but the 
credible bands paint a more modest picture. However, credible bands for /3j(0.9) — /3j(0.1) 
and /3j(0.9) — /3j(0.5), constructed directly from the posterior draws, indeed suggest more 
enhanced positive and negative effects, respectively for heavy vitamin use and BMI, on the 
upper quantiles (Table [^. 

We also performed a ten fold validation study to assess how well our joint model captured 
the intricacies of the beta-carotene data. In each fold of the study, we randomly partitioned 
the 315 observations into training and test sets at roughly 2:1 ratio. We fitted our joint model 
on the training data and obtained estimates j3j of /3j, j = 0,... ,p in the form of posterior 
means. These estimates were then used to evaluate the training and test data “check” loss at 
every r G {0.1,..., 0.9} by averaging pr(Y* — /3o('r) — Xj ${t)) over, respectively, all training 
and all test set observations (Yj, Yj), where Prir) = r{r — /(r < 0)}. The same was done 
with Koenker-Bassette, Reich et al.| ( |2011[ ) and standard least squares estimates. The relative 
accuracy of a method at any r was calculated as the reciprocal of its check loss at that r 
relative to the least square method. Figure shows these relative accuracy measures for the 
three quantile regression methods, averaged across the 10 repetitions. Our joint QR method 
can be seen to offer the best test data accuracy across all r values and maintain its advantage 
over least squares at the upper quantiles where the other two quantile regression methods 
appear to suffer a sharp loss of efficiency. 


^relabeled for better clarity as ‘3 — the original label’ 
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Figure 5: Parameter estimation for plasma data analysis. Green lines and bands give posterior 
means and pointwise 95% credible bands for intercept and slope curves, overlaid with the 
95% bootstrap confidence bands obtained from single-r Koenker-Bassett fits shown in black. 


6.2 Survival analysis under right censoring 

Joint estimation of quantile regression parameters could be particularly beneficial for survival 
analysis with censored response. A greater borrowing of information may help cover the 
information gaps left by censoring. A crossing-free estimation of the quantile functions means 
that the estimated survival curves are proper and interpretable. Also, a joint estimation offers 
an automatic way to quantify estimation uncertainty of the entire survival curves by simple 
inversions of estimated quantile functions. The probabilistic modeling framework of our joint 
quantile regression approach makes it particularly straightforward to handle right-censoring. 


16 































































j 

Predictor 

95% GI for /3j(0.9) -/3j(0.1) 

95% Cl for ,5j(0.9) - I3j 

1 

Age 

(-0.52,2.32) 

(-0.88,1.69) 

2 

Sex2 

(-43.34,54.12) 

(-33.53,55.31) 

3 

SmokStat2 

(-66.81,28.51) 

(-47.64,35.7) 

4 

SmokStat3 

(-102.75,26.64) 

(-95.05,19.97) 

5 

Quetelet 

(-5.43,0.12) 

(-4.93,-0.14) 

6 

VitUsel 

(-21.05,54.43) 

(-14.1,55.09) 

7 

VitUse2 

(12.27,177.02) 

(10.57,153.73) 

8 

Galories 

(-0.01,0.01) 

(-0.01,0.02) 

9 

Fat 

(-0.75,0.33) 

(-0.58,0.25) 

10 

Fiber 

(-3.7,3.14) 

(-3.22,2.54) 

11 

Alcohol 

(-0.74,2.32) 

(-0.72,1.51) 

12 

Gholesterol 

(-0.12,0.16) 

(-0.12,0.09) 

13 

BetaDiet 

(0,0.02) 

(-0.01,0.01) 


Table 1: Evidence of more dramatic upper tail effects of certain predictors on plasma beta- 
carotene concentration. Cl denotes posterior credible interval. 
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Figure 6: A 10-fold cross validation assessment of the fit of various linear quantile regression 
methods to plasma data with standard least squares regression being the benchmark. In held 
out test data, the proposed Gaussian process method (GP) offers better fit at all quantiles 
than Koenker-Bassette (KB) or the method by Reich et al. (2011| BP). 
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The log-likelihood score calculation 0 now changes to 
^[(1 - Ci) log frivilxi) -)-Cilog{l - Fyiyilxi)}] 


= ^ [cilog{l - T^iiyi)} - (1 - Ci) log {Po{rxiiyi)) + xf${Txiiyi))} , 

i 


(16) 


where c* is the censoring status (1= right censored, 0 = observed). With this single change, 
the same prior specification and Markov chain Monte Carlo parameter estimation as detailed 
in Section remain applicable. 

We illustrate these points with a reanalysis of the University of Massachusetts Aids Re¬ 
search Unit IMPACT Study data (UIS, Hosmer and Lemeshow, 1998, Table 1.3) in which we 
estimated the conditional quantiles of the logarithm of the time to return to drug use (Y) 
as linear functions of current treatment assignment (TREAT, 1 = Long course, 0 = Short 
course), number of prior drug treatments (NDT), recent intravenous drug use (IV3, 1 = Yes, 
0 = No), Beck depression score (BECK), a compliance factor measuring length of stay in the 
treatment relative to the course length (FRAC), race of the subject (RACE, 1 = Non-white, 0 
= White), age (AGE) and treatment site (SITE). For model fitting, we used the 575 complete 
observations available in the uis data set of the R package quantreg. Return times were right 
censored for 111 of these subjects. 

Figures show parameter and survival curves (for 9 randomly chosen subjects) estima¬ 
tion with our joint quantile regression approach and also with the censored quantile regression 
approach described in Koenker (2008). The latter was implemented by using the crq func¬ 
tion in R-package quantreg which uses a technique by Portnoy (2003). For joint estimation, 
we fixed the base probability density /o to be ^(0,1) instead of a t^, since the tails of the 
distribution of log return time are expected to be fast decaying. The two sets of parameter 
estimates are comparable, except in the upper tails. The Portnoy method fails to produce 
an estimate beyond r = 0.88, and confidence intervals get extremely wide for r close to this 
limit. In contrast, the credible bands from joint estimation are much more stable across the 
entire range of r. Estimated survival curves are remarkably similar, though for the Portnoy 
method, the issue of quantile crossing manifests in the form of estimated survival curves that 
are not strictly decreasing (e.g., subject # 313). 


7. Discussion 

We have introduced a complete and practicable theoretical framework for simultaneous esti¬ 
mation of linear quantile planes in any dimension and over arbitrarily shaped convex predic¬ 
tor domains. Although we have pursued here a specific estimation procedure, our modeling 
platform is extremely broad and parameter estimation could be done in a variety of other 
manners. For example, one could choose to use spline based estimation of the basic functions 
Wo,... ,Wp via penalized likelihood maximization or Bayesian averaging. Also, a variety of 
specifications could be used on the diffeomorphism parameter C, e.g., one could model C as 
a mixture of beta cumulative distribution functions, or try estimating directly by adding 
isotonic regression type constraints. 

A number of interesting features could be added to the Bayesian parameter estimation 
method we have pursued here. An important consideration is shrinkage for large p. For 
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Figure 7: Parameter estimation for UIS data analysis. Green lines and bands give posterior 
means and pointwise 95% credible bands for intercept and slope curves, overlaid with the 
95% bootstrap confidence bands obtained from the Portnoy approach. 


moderately large p, any standard shrinkage prior could be used on 7 , and the resulting 
posterior could be explored by the same Markov chain sampler as in Section 3.2 as long as 
the prior density on 7 is available in an explicit form up to a normalizing constant. Shrinkage 
could also be applied on the curve valued parameters Wj, j = 1 ,...,p, by choosing appropriate 
prior distributions on ..., Hp). An attractive choice is to replace the single gamma prior 
distribution we used in Sectionj^with a spike-slab type mixture of gamma distributions, e.g., 
~ 0.5Ga{aK,bK) + 0.5Ga(aK, 6 k/ 100). Such a specification still allows integrating out kj 
in (12) and hence could be explored by the same Markov chain sampler as before. 

The primary computational bottleneck of our method is that the likelihood evaluation 
involves a search over the grid of r values for each observation. While our current imple¬ 
mentation easily scales to thousands of observations, scaling it to even larger datasets will 
require further computing innovations. Fortunately, the likelihood evaluation is embarrass¬ 
ingly parallel in the observations and involves very simple arithmetic operations, and thus, 
it should be possible to obtain manyfold speed ups by the use of graphics processing units; 
such an implementation is currently underway. 
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Figure 8: Estimated survival curves for 9 random sampled subjects in the UIS study. Green 
lines are posterior draws of the survival curves. Blacks lines are the estimates from the 
Portnoy approach, and the dark red dashed lines are estimates under the Cox proportional 
hazard model. 


A. Technical details 

This section presents a proof of Theorem starting with a few fundamental results that 
allow comparing two probability density functions given information on their corresponding 
quantile density functions. We adopt the following notation in the remainder of this section: 
by a ‘probability function quartet’ we mean a four-tuple {Q,q,F,f) of real valued functions 
where Q : (0,1) —>■ M is a non-atomic quantile function that admits a strictly positive deriva¬ 
tive q = Q, F = Q~^ is the associated cumulative distribution function with probability 
density function f = F. Recall the identities f{y) = l/q{F{y)) and q{t) = 1/f{Q{t)). 

A.l Auxiliary results 

Lemma 5. Let (Qi, ( 71 , Fi,/i) and (Q 2 , 92 , F 2 ,/ 2 ) be two probability function quartets and 
take ruj = Qj{To), j = 1,2. If there exist O<C 1 <I<C 2<00 such that ciq 2 {t) < qi{t) < 
C 2 q 2 {t), for all t G (0,1), then, 

f 2 {y) = fi{y + Ai(y)) • A 2 {y), for all y E supp(/ 2 ), 
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for two real valued functions Ai, A 2 satisfying |Ai(y)| < max(l—ci, C 2 —l)|y—r?T, 2 | + |mi—m 2 |, 
and, A 2 {y) G [ 01 , 02 ]. 

Proof. By the assumption on qi, q 2 , for every y G supp(/ 2 ), 

oiy + 61 < QiF 2 {y) < 02 y + 62 , 


where hi = mi — cim 2 , 62 = ^1 — 02 m 2 . Then, Ai{y) := QiF 2 {y) — y satisfies, ||Ai(y)|| < 
max(l - Cl, C 2 - l)\y - m 2 \ + \mi - m 2 \. Since, fi{y) = l/qi{Fi{y)), i = 1,2, we have, for any 
y G supp(/ 2 ), fi{Qi{F 2 {y)))qi{F 2 {y)) = 1, and hence, 

_ MQmiymrnmy)) ^ ^ 

q2{F2[y)) q2[F2{y)) 


which proves the result. 


□ 


Lemma 6. Let f* satisfy the conditions of Theorem^^ Given any 5, a, ci, C 2 > 0, there exists 
an e > 0 such that , 


sup [ fYiy\x)\\og{fo{y/(T + A{y))/a}\dy < 6 

x£X J[Q^{e\x),QY{l-e\x)] 


for every A : M —)■ M satisfying |A(y)| < ci|y| + 02 for all y G M. 

Proof. By the assumption on /*, (t|x)/( 7 y (t|0) = 1 + x’^{t)/$^{1) is bounded away from 
zero and infinity. Hence, by Lemma[^ there are constants a, 6 > 0 such that for every x £ X, 
Qy(Ty(y|0)|x) = y + Ai^^^iy), with |Ai,a;(y)| < a|y| + b for all y £ Aq := supp(/y(-10)). Fix 
any x £ X. By the change of variable 2; = Qy(Fy(y|x)|0). 


/y(y|x) log/o 0 + A(y)) 


dy 


log/o 


Q*y{F*{zmx) 


= [ mm 

JAo 

= [ fYimmfo{z/a + A2{z))\dz, 

JAn 


+ AiQ*ymm\x)) 


dz 


with |A 2 ( 2 ;)| < \ Ai{z)\/a + |A( 2 ; + Ai( 2 ;))| < ai|z| + 61 where ai, 61 >0 depend only on 5, a. 
Cl and C 2 . The tail assumption on /y(-|0) implies that the last integral is finite, proving the 
result! □ 


Lemma 7. Fix 'yo G M, 7 G cr > 0, m : (0,1) —)> and two differentiable, monotonically 

increasing functions C, • [0)1] [0,1], with [C(0),C(1)] C [("1(0), Cl(l)] • Let {/3o,/3) = 

r(7o,7)U,m,C), (^J,/3l) = r(7d,7^u■,m,Cl), where, 

KHto) KHto) 

7 o = 7o + o-/ qo{u)du, 7 ! = 7 + a / qo{u)h{u)du, 

■ICHo) 

with h{T) := w{t) /{a{w{T), X)^l + l|u’('^)P}) 'T ^ (0, !)• Fix any x £ X and consider the 
probability function quartets (Qy (-lx), gy (-lx), Fy(-|x), fy{-\x)), (Qy (-lx), gy (-lx), ly(-p),/y(-p)) 
where Qy(r|x) = /3o{t) + x^^(t), Q^^plx) = /3d(r) +x^^l(r). Then, friylx)/fliy\x) = 
Cl(F^(y|x))/C(Fy(y|x)) for all y £ supp(/y(-lx)). 


21 










Proof. Let ri = C('^o) and, define, 


gro(T|x) = g'o(T){l + x'^/i(r)}, r G (0,1), x ^X. 

Then go(T|x) is strictly positive, and, hence, 

Qo{t\x) := / qo{u\x)du, r G (0,1), x G Af, 

J T1 

defines valid quantile planes on X. Denote the associated conditional distribution and density 
functions by Tnl-lx) and fo(-|x). By definition of oyfrlx) = Bnir) + x^/3(t) = 

o'®(C(t)) {l + x'^/i(C(r))} C(r) = ugo(C('r)|a;)C(r), and, hence. 


Qy(r|x) = 70 + x'^7 + / gy (u|x)du = 70 + x '^7 + uQoCCMk)- 

J TO 

Similarly, gy(T|x) = (xg'o(C^(T)|x)C^(r), and, hence, 

Q|^(r|x) = 7(5 + x'^ 7 '^ + f ql.{u\x)du = 'yo + x'^-f + aQo{C\T)\x). 

J TO 

by the definitions of 7 ( 5 , 7 "^. Inverting (17), we get, for every x G X, 

/Qy(r|x)- 70 -x ^7 \ 

C(r) = To ( - - - X 1 , r G (0,1). 


(17) 


(18) 


Therefore, if ?/ G (Qy (0|x), Qy (l|x)), then, 

T ( 1 ^ _ 1 ^ 

^ ^ gy(Fy(y|x)|x) aqo{C{FYiy\x))\x)C{FY{y\x)) (TC(Fy(y|x)) 

Similarly, /y(?/|x) = '^ \x)/{aCf{FY{y\x))}, proving the result! □ 


A.2 Approximating f* within assumed model space 

Let Hi/ denote the conditional prior distribution on / under 11 given v. Theorem]^ is proved 
in two stages. Let t'o > 0 such that the tails of /y(-|0) are of type I or II with respect to 
foi'W) for every 0 < v < uq. First we show that for any such v and any given <5 > 0, there 
exists an p G Ks{f*) within our model space with nicely behaved underlying Wj curves. 
Next we show ■ || log(/^^//)||oo < <5) > 0 which leads to the claim of Theorem]^ The 

following lemma gives a precise statement of the first step. 

Lemma 8. Let f* satisfy the conditions of Theorem^ For any small <5 > 0 and t) < v < vq, 
there exists an p G Ks{f*) associated with (/Iq,/?'!') = T{pQ.,p.,a'\w\p) where : [0,1] —)> 
MP is bounded continuous and : [0,1] —)> [0,1] is a diffeomorphism with PP £ [e~^,e^] 
for all t G [0,1] for some finite B > 0. 
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Proof. Fix a v € (0,t'o) a 6 £ (0,ro). All calculations below are carried out for this 
particular value of u and we suppress from the notation /o(-|i^). 

Let 7 q = /3q(to). Fix a ctl > 0 such that cl((Tl) < 1/2 if /y(-|0) has a type I left tail 
with respect to /o, or, UL{aL)log{l/uL{o'L)} < S/2 if the left tail is of type II. Similarly fix 
aR and take a* = min{(Ti, cjr}. Define : [0,1] —)• [0,1] as 

cm = Fo (70 + — 1], 


which is differentiable and monotonically increasing, and, whose derivative can be written as. 


cm 




pm-% \ 

a* J 


m) 


{l/a*)fo (70* + 

Wmm 


f G (0,1), 


since /3Q(t) = Qy(f|0). 

Because C* has a continuously differentiable inverse on [("*(0), C*(l)], the relation h*{C*{u)) = 
/i*( m)//3q( u) defines a map h* : [C*(0), C*(l)] —> 1^^ that is bounded and continuous by the 
assumption of Theorem and hence, can be extended to a bounded continuous function 
h* : [0,1] -£■ MP. Define w* : [0,1] —)• MP as follows, essentially repeating the construction in 
the “Only if part” of the proof of Theorem If h*{t) = 0 then set 'w*{t) = 0. Otherwise, 
take c{t) = [[l/{||/i*(t)||a(/i*(r), A)}]^ — 1 ]“L 2 gg^ w*{t) = c{t)h*{t)/\\h*{t)\\. By the 
assumption on P*/Pq, w* is a bounded continuous function on [0,1]. 

By construction (/3 q,/3*) = T( 7 o, 7 *, o'*, tc*, C*)- However this parameter vector may not 
be in our model space since we may have either [C*(0), C*(l)] / [0,1] or ||C*||oo = co- We 
correct this by introducing a proper diffeomorphism ("f on [0,1] with (/t bounded away from 
0 and infinity, such that C^(^) = C*(^) for t G [(5^, 1 — 5/?] for suitably chosen small numbers 
Sl,Sr > 0. This is the crux of the approximation argument. 

If the left tail is type I, then C*(0) > 0 and lim^oCC^) = ^^(ct*) G (0,1/2). So one 
can fix > 0 small enough such that C*{Sl) > Sr, C*(t) G (cl((T*)/2, 1] for all t G (0,5^], 
and, 5L\og[4:/{cL{a*)5i/\] < 5/2. Otherwise, the left tail is type II, and in that case choose 
5l = uiicr*), which automatically ensures C*(^) ^ 1 foi' a-H t ^ (0,1) with C*(^l) = Ij and, 

5l \og{l/5 l) < 5/2. Since C*(0) > 0, we also must have C*{5l) > Sr. Fix 5r by repeating the 
same steps with the right tail. Define : [0,1] —)■ M as. 


c\t) 


C*{C)-, t G [5l, 1 - 5r], 

< + bRt, t G [0, 5l) 

, I - aR{l-ty-bR{l-t), tG(l-(fK, 1], 


where. 


aL 


5lC{5l)-C*{5l) 

Si 


bi 


2C{5l)-5lC*{5l) 

5l 


5rC*{'^ - 5r) - {I - C*i'^ - 5 r)} ^ 2{1 - C*(l - (Jr)} - 5_rC*(1 - 

OR = -To-> Or = --. 

Sr Sr 

By choice of 5l and 5r, or < 0, or < 0 and 6 r G [C {5l),2/5l], bR G [C*(l - 5r),2/5r]. 
It is straightforward to verify that defines a diffeomorphism from [0,1] onto [0,1], with 
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C^{t) G ^l] for all t G [0,(^l] and C^(t) G [C*(l—dij), 6i{] for all t G [1 —<5/?,!]. Therefore 

there exists a B > 0 such that C^(^) £ ^ for ^ ^ [0; !]■ 

Take (/ 3 ^,^t) = ,cr*,w* , C^) with valid conditional quantile planes Q\r{-\x) and 

associated cumulative distribution and probability density functions given by Ty(-|x) and 
/y(-|x). By construction of Qy(r|x) = Qy(r|x) for all r G [(5l, 1 — and hence 
F^( 7 /|x) = Fy( 7 /|x) for all y G [Qy{6l\x), Qy(l - 5 _r|x)]. Hence, by Lemma[^ 


dxL(/y(-|x),/^(-|x)) 


' y&QY{5L\x),Qlr{l-5R\x)Y 


/y(?/|x)log 


ct(4(^ix)) ^ 

C*{F;.{y\x)) ^ 


Split the integral above into two integrals, one over y < Qy{6l\x) and the other over y > 
Qy(l - Sii\x). When y < Qy{5l\x)], both Fy(?/|a:) < 6l, and, Fy(y|x) < 5l- Clearly, 
C^(Ty(y|x)) <bL < 2/5l. If left tail is type I then, C*(Fy(y|x)) > CL{o'*)j2 and hence, 


rQyi^Llx) 

Iq*(o\x) 


I M CHfUvI^)) ^ ^ 

/y(y|x) log , ,, dy < 


C*iF*{y\x)) 


log 


r-Qy(<5i|a:) 


= Sl log 


(5lcl((T*)J Jq*^(0\x) 
4 


fYiy\x)dy 


5lcl{(7* 


< 5/2 


by the choice of 5l for the type I left tail. On the other hand, if the left tail is type II, then 
C*(Fy(y|x)) > I and hence 


rQy h) 
Iq*{0\x) 


frivlx) log < 5l log < 5/2, 


C*{F^{y\x)) 


2 

Tl 


again by the choice of 5l for this case. Same arguments apply to the integral over y G 
[Qy(l — (5 h|x), Qy(l|x)], and hence, for every x G T", ^ 5. Therefore 

dKL{f*,P) = f fx(x)dKL(fy(-lx), /^(•|x))dx <5. □ 


A. 3 Proof of Theorem 


Since the prior on i/ has full support, it suffices to show that given any 5 > 0 and ly < uq, 
the conditional prior Hi, = n(-|z^) assigns positive mass to the event {/ : dKiif*,/) < 3<5}. 
Fix any h > 0, and u < uq. By Lemma ^ there is a (/3 q,/3'I') = T{'^q,^\(j\w\C/^) with the 
associated probability density function /r satisfying dxLif*, P) < 5, where, : [0,1] —)■ 

[0,1] is a diffeomorphism with || logCt||^ < oo. 

+ , , ,, J. ^ 

Xapd\am.{X), Icr/cr'^ — 1| < A, u) : [0,1] —)■ is continuous and sup^ ||u)(t) — rc^(t)|| < A, 
C : [0,1] —>■ [0,1] is a diffeomorphism and ||logC ~ logClIoo < A. By construction and 


is bounded continuous, and, : [ 0 , 1 ] 

For any A > 0, let Ax denote the set of ( 70 , 7 , a, w, such that I 70 — 7 ol < A, II 7 — 7 '!^! 


because of the full support properties of Gaussian processes (Tokdar and Ghosh, 2007), 
the conditional prior Hi, assigns a positive mass to the set of / associated with (/3o)/3) = 
F{'fo,'y,cr,w,(), ( 70 , 7 , (T, tc, C) G Ax, for every A > 0. So, it suffices to show that A > 0 
could be chosen small enough such that any / associated with a ( 70 , 7 , u, tc, C) G Ax satisfies 
/ fyivlx) log{ fl{y\x)/fy{y\x)}dy < 25 for all x G T”. 
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Let b = \\ logC^lloo + 1- Since is bounded on [0,1], there exists a i? > 0 such that 
h^{t) := w^{t)/{a{w^t),X)^^/Y+~^w^t)^} satisfies 

1 + x'^h^t) G [e~®, e'®] , for all t G [0,1]. 

Clearly, there exists a Aq G (0,1/2) such that || logC — logC^Hoo < Aq implies || logCHoo < 26, 

and, supj|t(;(t) — w^t)\\ < Aq implies 1 + x'^wit)/{a{w{t), + ||rc(t)P} G 

for all t G [0,1]. Take ci = 1 + C 2 = ci{| 7 ( 5 | + ll^^ll + 1)} + and, ci = (1 + 2ci)/cr'l', 

C 2 = 2 (c 2 + | 7 o| + By Lemma there exists an 0 < e < 6 / max{126, 6 (i? + 1)} such 

that, 

sup [ fY{y\x)\logfo{y/a'^ + A{y))/a'<\dy<S/3. 

for every A : M —)> M satisfying |A(y)| <ci\y\+C 2 for all y G M 

Take any ( 70 , 7 , a, u;, C) G Ax^ and let (/3o,/3) = T{'yo,'l,cr,w,C), = T, w 

and, (/3 q,/3®) = T{'yQ,'y^,cr,w,e), where e denotes the identity function on [ 0 , 1 ] onto itself 

and 

/■TO rro 

70 = 7 o + / qo{u)du, = 7 ! + / qo{u)h\u)du, 

pro pro 

7 o= 7 o + o-/ qo{u)du, 'y^ = -f + a qo{u)h{u)du. 

■ICIto) ■lC('fo) 


with h{t) = w{t)/{a{w{t), X)x/'^ + ||r(;(t)P}, t G [0,1]. The definitions of 70 j7o)7o^ 7^^ 
match the requirements of Lemma[^ Let {QY{-\x),qY{-\X), Fy{-\x), fY{-\x)) denote the prob¬ 
ability function quartet of (/3o, /3), and the same symbols with appropriate superscripts denote 
the same quantities associated with the other three pairs (/3g,/3®), (/3 q,/ 3^) and (/3 q^,/3®^). 

Consider the following factorization in log-scale 


log 


fY{y\x) 




+ log 


fY{y\x) 


+ log 


fy{y\x) 
fyiy\x) ■ 


By Lemma[^ | log{fl{y\x)/fp{y\x)}\ = \ log Ct(i4(y|x))| < 26, and, | log{/f (y|x)//y(y|x)}| = 
I logC(Fy(y|x))| < 26. Since 1 + x'^h{t) G [e“^'®,e^'®] for all t G [0,1], we have, for any 
X € X, gy(t|x) = (i'y(t|0) • [e“^'®,e^^] for all t G (0,1), and hence, by Lemma/y(y|x) = 
/f (y-h Apa;(y)|0)A2,x(y), with \Ai^^ly)\ < ci\y\ + C 2 for all y G M, and, || log A 2 ,x||oo < 2B. 
But, /y(yio) = /o((y - 'yo)/a)/a and so, /yjy|x) = My/a'f + Ai^^{y))A 2 ,x{y)/(T'< with 
|Ai^ 2 ;(y)| A ci|y| -|- C 2 , for all y G M, and, || log A 2 ,a;||oo < B -|- 1. The same calculations work 
for fp because G Therefore, 


' [Ql.{e\x),Qp{l-e\x)]‘^ 


/y(y|x) log 


/y(yk) 


fY{y\x) 


dy < 6, 


for every x € X. 










The map {x,y) i—)• log/y(y|x) is equicontinuous on {{x,y) : x € X,y £ [Qy (e|x), Qy (1 — 
e|x)]}, and hence there exist a k > 0 such that log |/y (y + z|x)//y(y|x)| < 5/2 for all x G T, 
y £ Qy (1 “ kl < !'<■■ Fix a small 0 < rj < k/2 such that 

max(e’^ — 1,1 — e~'^) ■ sup maxjlQy(e|x) — 7o — \Qy{^ ~ “ 7o “ ^^7^1} < tt- 

x&x 2 

By the equicontinuity of the maps s i—)• log qo{e^) and s i—)• on the interval [log e, log(l — 

e)], and the continuity of the transformation v i—)■ v/{a{v, X')-sJ\ + ||u|p}, one can fix 0 < A < 
min(Ao, k/ 4) such that for any (70,7, cr, re, (") G 

qY{t\x) _ a qoiCit)) „ 1 + x'^/i(C(t)) CW r r,n 

for every t G [e, 1 — e] and x G T. Consequently, by Lemma for every x £ X and y £ 
[Qy(e|x), Qy(l — ejx)], | log{/y(y|x)//y(y|x)}| < 5 . This proves the result. 

B. Computational details 


B.l Centering the predictors 

A preprocessing step of our method is to center the observed predictors {xi,..., x^} around 
an interior point of their convex hull (Figure]^. While the sample mean vector automatically 
gives an interior point, it may lie too close to the hull boundary and lead to poorer model 
fit. A better strategy is to use the mean of the extreme points of the data cloud, but finding 
the extreme points becomes computationally intensive for p > 2. Instead, we employ a fast 
algorithm that recursively identifies p + 1 points x\,... ,x*y;^, from the data cloud that are 
close to the boundary and far away from each other. 

Consider a Gaussian process / on with covariance function C{x, x') = exp{—||A~^(x — 
x')|p}, where A is the pxp diagonal matrix with j-th element equaling the observed range of 
the j-th predictor, j = 1,... ,p. Take x^ = xi and recursively select x* as the x G {xi,..., Xn} 
with maximum Var(/(x)|xy ..., x^_^), j = 2,...,p-|- 1. This recursive selection can be 
carried out extremely fast, with computational complexity of the order (p + l)nlogn flops, 
by carrying out a rank-(p -|- 1) incomplete, pivoted Cholesky decomposition of the n x n 
non-negative definite matrix K = {(C{xi,Xj))), for example, by using the inchol function 
of the R package kernlab. Such implementations depend on the order in which the XjS are 
stored. To encourage selection close from the boundary, we prearrange the XjS in decreasing 
order of their Mahalanobis distance ||5'“^(xj — x)|| from mean x, where S denotes the sample 
covariance. 

B.2 Choosing A grid points 

In choosing the grid points A^, y = 1,..., G, for Aj, it is important to ensure that the condi¬ 
tional prior distributions ^"(0, KjG**(Ag)) remain sufficiently overlapped for neighboring Xg 
values, since otherwise, the grid based discretization of the prior on A may lead to poor mix¬ 
ing of the Markov chain sampler. If overlap is measured by the Kullback-Leibler divergence 
d{X,X') := fii('L(A(0, k^G**(A)), A(0, k|C'**(A'))), which does not depend on kj, it is easy to 
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input : Model parameters 
scalars; 70, o', 
vectors: 7 = (71, • • • ,7^)^, 

functions: w = {wi, ■ ■ ■ , Wp) : ( 0 ,1) —)• diffeomorphism ( on [ 0 ,1] 

output: Log-likelihood score 

1 // Basic quantities 

2 for ^ = 1 : m set 60 ,i 

3 for I = 1 : m // Could be parallelized in I 

4 set vi = uj{C{ti)); 

5 for i = 1 : n calculate ai = xjvf, 

6 calculate ax = maxi<j<„{—aj}/y^||uj||; 

7 for i = 1 : n set an = ai/{ax ■ s/l+Jvi^}; 

8 endfor 

9 // Calculatelog likelihood score by sequencing through obs 

10 set = 0 ; // initialize the log likelihood 

11 for i = 1:n // Could be parallelized in i 

12 calculate Qo = 7o + 

13 if Li > Qo then 

14 set Qu = Qo and I = k; 

15 while Yi > Qu do 

16 set Qu = Qu and 1 = 1 + 1 -, 

17 if I < m then calculate 

Qu = Ql + (<J/2) • {6o,/-i(l + +,i-i) + bo,i{l + hi,/)}; 

18 else set Qu = 00 

19 end 

20 else 

21 set Ql = Qo and / = A: -|- 1 ; 

22 while Yi < Qu do 

23 set Qu = Ql and 1 = 1 — 1-, 

24 if I > 2 then calculate 

Ql = Qu — (h/2) • {6o,/-i(l + hi,/-i) + ho, /(I + hi,/)}; 

25 else set Ql = — 00 ; 

26 end 

27 end 

28 if Ql = —00 or Qu = 00 then 

29 I set U = —00 

30 else 

31 calculate a = (Fi - Ql)/(Q;7 - Ql); 

32 set U = it- log{(l - a)6o,/-i(l + hi,/_i) oho,/(I + h*,/)} 

33 end 

34 endfor 

35 return ll 

Algorithm 1; Log-likelihood evaluation 
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Figure 9: A toy demonstration of finding an interior point of the convex hull of observed 
predictors. The observed predictor vectors are 9 dimensional B-spline transforms of 500 
uniform random drawn from [0,1], shown as black dots (with some jitter to improve visibility) 
in the pairwise plots. The red X denotes the projection of the sample mean, and the large 
blue dot denotes the projection of the interior point found by our preprocessing method. The 
green crosshairs are the selected 10 points. 
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see that one must use a non-uniform grid of A values since for a given A > 0, d{X, A -|- A) 
is much larger for a small A than a large one. To choose this non-uniform grid, we set Ai to 
be the smallest value in the predetermined range, one that gives po.i(Xi) = 0.99, and then 
increment A recursively so that d(Ag-i, Xg) = 1, g = 2,3,.. until the whole range is covered. 
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